[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling#31593
[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling#31593robertgshaw2-redhat merged 62 commits intomainfrom
Conversation
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the quantization configuration for FlashInfer experts, removing a hack and improving code clarity. The changes appear correct and align with the goal of cleaning up the implementation. However, the PR includes several debugging statements (e.g., logger.info, print, timing code) across multiple files. These should be removed before merging to avoid polluting logs and potential performance impacts. Additionally, there is a critical issue in vllm/model_executor/models/llama4.py where a logging statement references a variable defined in a commented-out block, which will cause a NameError at runtime.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/model_executor/models/llama4.py (524-539)
This block contains commented-out debugging code and an active logging statement that references start, a variable defined within the commented-out section. This will lead to a NameError at runtime. The entire block should be removed.
vllm/model_executor/layers/fused_moe/layer.py (954-956)
This logging statement appears to be for debugging purposes and should be removed before merging.
vllm/model_executor/layers/fused_moe/layer.py (1012-1021)
This block of code, including the time import and performance logging, seems to be for debugging and should be removed.
vllm/model_executor/layers/fused_moe/layer.py (1274)
This logger.info call appears to be for debugging. Please remove it, along with similar debugging logs at lines 1292, 1304, and the commented-out log at line 1366.
vllm/model_executor/models/mllama4.py (1126)
This print statement appears to be for debugging. Please remove it, along with the other debug prints in this function at lines 1130, 1136, and 1141.
vllm/model_executor/models/utils.py (198)
This logger.info call appears to be for debugging. Please remove it, along with the other debug logs added in this file (lines 219, 255, 264, 292, 302, 334).
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
now working e2e with fi cutlass need to make a few more nits for flashinfer trtllm |
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
…thub.com/vllm-project/vllm into fix-flashinfer-experts-quant-config-hack
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
just reran the quality checks on top the head after the nits, accuracy still looks good |
pavanimajety
left a comment
There was a problem hiding this comment.
Thank you for making the changes.
Thanks for your great feedback and review @pavanimajety ! |
| w2_scale=layer.w2_weight_scale, | ||
| a1_scale=layer.w13_input_scale, | ||
| a2_scale=layer.w2_input_scale, | ||
| a1_gscale=(1.0 / layer.w13_input_scale), |
There was a problem hiding this comment.
I think this function is called every forward, which means these 2 lines will result in 2 kernel launches for reciprocal:
a1_gscale=(1.0 / layer.w13_input_scale),
a2_gscale=(1.0 / layer.w2_input_scale),
Can we add these 2 scales in process_weights_after_loading ?
There was a problem hiding this comment.
its not called in the forward pass. I recognize this is confusing, but the apply() method is not called during the forward pass for flashinfer kernels. When flashinfer CUTLASS kernels are selected, the FpMoeMethod is converted into a ModularKernelMethod
I am working on an ongoing refactor that makes the conversion
- This PR implements it for FP8: [MoE Refactor][15/N] Apply Refactor to Fp8 #31415
- This PR implements it for NVFP4: [MoE Refactor][16/N] Apply Refactor to NVFP4 #31692
see https://vllm-dev.slack.com/archives/C08NFPURQ1F/p1767650816469009 for more details on my efforts
…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…#31593) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
Purpose
Test Plan
Test Result
Llama4 Scout
Qwen3-30B
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.